ggplot() and dplyr tutorialWelcome to your first tutorial for this class, COMP/STAT 112: Introduction to Data Science! As you work through the different sections, there will be videos for you to watch (both embedded YouTube videos and links to the videos on Voicethread), files for you to download, and exercises for you to work through. The solutions to the exercises are usually provided, but in order to get the most out of these tutorials, you should work through the exercises and only look at the solutions if you get really stuck. You could also work through the exercises in your own R Markdown file in order to keep the results permanently. If you do that, start the file with the three code chunks I talk about below. Then copy and paste the questions into your document and put your solutions in R code chunks.
If you haven’t done so already, please go through the R Basics document.
When you start your own document, you should have the following three code chunks at the top of your R Markdown file:
knitr::opts_chunk$set(echo = TRUE, message=FALSE, warning=FALSE)
install.packages() function in the console and write the name of each of the packages you want to install. You only need to install packages once, although you will need to re-install them if you upgrade to a new version of R. You need to load them with the library() statements each time you use them. There is a good analogy with lights: installing the package is like putting the light in the socket, loading the package is like turning the light on.library(tidyverse) # for graphing and data cleaning
library(googlesheets4) # for reading in data from googlesheets
library(lubridate) # for working with dates
library(palmerpenguins) # for palmer penguin data
theme_set(theme_minimal()) # my favorite ggplot theme
gs4_deauth() # skips google authorization
data() function. Data outside of a package can be loaded in different ways depending where it is and what type of data it is. In the code below, I use the read_sheet() function from the googlesheets4 library to load data from my google drive.# Palmer Penguins data from palmerpenguins library
data("penguins")
# Lisa's garden data on her google drive
garden_harvest <- read_sheet("https://docs.google.com/spreadsheets/d/1DekSazCzKqPS2jnGhKue7tLxRU3GVL1oxi-4bEM5IWw/edit?usp=sharing") %>%
mutate(date = ymd(date))
Before jumping into teaching you some Data Science skills in R, I want to give you some motivation. I picked three graphs I’ve recently seen on Twitter. These are all responses to #TidyTuesday which you’ll be participating in very soon! Read more about it here if you’re curious. There are many definitions of Data Science but I broadly like to think of it as using data to tell a story. These three graphs are just a small sample of doing just that.
One of my favorite Data Visualizers on Twitter:
And here's the #makingof of this week's #TidyTuesday submission#dataviz #rstats https://t.co/zPvjs4KdaH pic.twitter.com/iqTuOFpP4b
— Georgios Karamanis (@geokaramanis) April 18, 2020
One of my former students (and your preceptor!):
This wk's @R4DScommunity #TidyTuesday: guess what a centered dot-plot of astronauts in space by year and nation looks a lot like?
— lil bobby tables 🐳 (@robert_b_) July 15, 2020
A space station in mid-orbit (or Cloud City)! #RStats #r4ds #DataScience #DataViz #tidyverse #ggplot2 pic.twitter.com/hqW7KLWmsn
A #TidyTuesday newcomer:
My first #TidyTuesday! I decided to K.I.S.S. and focus on aesthetic for my first week. Thank you @kllycttn for pointing me to the futurevisions palettes!
— Kelly Morrow McCarthy (@KellyMM_neuro) August 20, 2020
GitHub: https://t.co/S5YP0pFlvq
futurevisions: https://t.co/h0dfUYFOqi pic.twitter.com/7hqsz7cwdb
After this tutorial, you should be able to do the following.
ggplot2 functions.dplyr functions to begin “wrangling” data.%>%) together a sequence of dplyr functions to answer a question.dplyr verbs and ggplot() functions to wrangle and plot data.We will use two different datasets throughout this tutorial.
The Palmer Penguins dataset is from the palmerpenguins library. The data we will use is called penguins. You can read about it within R by typing ?penguins in the console.
Let’s do some basic exploration of the data. The code below uses the dim() function to find the dimensions of the dataset - the number of rows and columns.
dim(penguins)
## [1] 344 8
And we use the head() function to view the first 6 rows of the data.
head(penguins)
This dataset, which I named garden_harvest in the Set up, contains data that I have collected (and am still collecting) from my personal garden this summer. You can view the google sheet here. Each row in the data is a “harvest” for a variety of a vegetable. So, vegetables might have multiple rows on a day, especially if they are things I eat twice a day (lettuce) or there are many different varieties of the vegetable (tomatoes).
I fondly refer to my garden as the “Jungle Garden” because by the end of the summer all the plants are creeping out of their beds and it can be quite the adventure walking through it. Take a look at the video below for an in-depth tour of the garden and details around how I collect the data.
Voicethread: Jungle Garden tour
Let’s also get an overview of this dataset.
Use the dim() function to find the number of cases and variables in the dataset.
Use the glimpse() function to show the first few cases of each of the variables and see the type of variable.
ggplot()Now, let’s get ready to plot some data! The concept map below provides an overview of the functions you will be learning, how they relate to one another, and what they do.
First, watch the video below that introduces the ggplot() syntax.
Voicethread: Intro to ggplot()
Next, watch the video below that walks through some examples in R Studio. You can practice along with me by downloading the R Markdown file and working through the problems. If you do that, you will likely get somewhat different results than you see in the video when using the garden_harvest data because the data is still changing everyday :)
Lastly, watch this short video about common mistakes. Hopefully you won’t make them, but admittedly I sometimes still do.
Voicethread: ggplot() mistakes
Now you have the tools you need to begin creating your own plots. As you work through these exercises, it will be helpful to have the Data Visualization with ggplot2 cheatsheet open. Find the cheatsheet here or, from within R Studio, go to Help –> Cheatsheets and click on Data Visualization with ggplot2.
Use the penguins data to create a scatterplot of bill_length_mm (x-axis) vs. bill_depth_mm (y-axis). I have started the code for you. How would do describe the relationship?
penguins %>%
ggplot( (x = ,
y = )) +
geom_()
Now use the code you wrote in the previous exercise but color the points by species. How does this change how you described the relationship before?
Now use the code you wrote in the previous exercise but make the points smaller and more transparent.
Create a histogram of the flipper_length_mm.
Add a facet to the previous histogram so there is a different histogram for each species. Make it so there is one column of plots. How would you compare the distributions?
Create a barplot that shows the number of penguins for each year. Fill in the bars with the color lightblue.
The code below creates a new dataset called tomatoes. Use the tomatoes dataset to create a barplot that shows the number of days that each tomato variety has been harvested. Make the bars horizontal, fill them in with the color tomato4 , order them from most to least (hint: use fct_infreq() and fct_rev()). Also give the plot nice labels.
tomatoes <- garden_harvest %>%
filter(vegetable == "tomatoes")
Use boxplots to compare the flipper_length_mm by species. Make the boxplots horizontal. How does this graph compare to the faceted histogram you made above? What are the strenghts and weaknesses of each type of graph.
The code below creates a dataset (tomatoes_wt_date) that has the weight in grams of tomatoes (daily_wt_g) for each date. Use that to create a linegraph of the weight of tomatoes harvested each day.
tomatoes_wt_date <- garden_harvest %>%
filter(vegetable == "tomatoes") %>%
group_by(date) %>%
summarize(daily_wt_g = sum(weight))
The code below creates a dataset (tomato_variety_daily) that has the weight in grams of each variety of tomato (daily_wt_g) for each date. Use that to create a linegraph of the weight of tomatoes harvested each day, where there is a separate line for each variety, in a different color. What are some ways you might improve this graph?
tomato_variety_daily <- garden_harvest %>%
filter(vegetable == "tomatoes") %>%
group_by(date, variety) %>%
summarize(daily_wt_g = sum(weight))
dplyr functionsNext, you will learn how to wrangle and manipulate data using six dplyr functions. There are many other functions we can use (and we will!) but these six will get us pretty far, especially when combined. The concept map below shows the six functions I will introduce and what they are used for.
First, watch the video below that introduces the dplyr functions.
To recap, the six main dplyr verbs are summarized below.
Images from R Studio Cheatsheets: https://rstudio.com/resources/cheatsheets/
This table shows the logical operators often used with the filter() verb.
| Operator | Meaning |
|---|---|
== |
Equal to |
> |
Greater than |
< |
Less than |
>= |
Greater than or equal to |
<= |
Less than or equal to |
!= |
Not equal to |
%in% |
in |
is.na |
is a missing value (NA) |
!is.na |
is not a missing value |
& |
and |
| |
or |
Next, watch the video below that walks through some examples in R Studio. Just like with the ggplot() material, you can practice the dplyr problems along with me by downloading the R Markdown file and working through them. If you do that, you will likely get somewhat different results than you see in the video when using the garden_harvest data because the data is still changing everyday :)
Select vegetable, date, and weight from the garden_harvest data. I have started the code for you below.
garden_harvest #What do I need to put here?
select()
mutate()Add a variable for weight in kilograms, weight_kg. One kilogram is 1000 grams. I started the code below.
garden_harvest #What do I need to put here?
mutate()
mutate()Keep the weight_kg variable from the previous problem and also add a variable to the garden_harvest data called day_of_week that returns the day of the week. HINT: Use the function wday() and add an argument to that function that is label=TRUE.
filter()Filter the garden_harvest data to observations that have weights less than 50 grams.
filter()Filter the garden_harvest data to peas and beans with weights larger than 40 grams.
arrange()Order the observations in the garden_harvest data from largest to smallest weight.
arrange()Order the observations in the garden_harvest data from largest to smallest weight on each date.
summarize()Find the total weight in grams and how many rows of data are in the garden_harvest data.
summarize() with group_by()Find the total weight in grams harvested for each date.
dplyr verbsI love tomatoes. Well, truthfully, I love things made out of tomatoes - spaghetti sauce, salsa, soups, and even ketchup. I always want to know which variety of tomato is most productive. In this exercise, start with the garden_harvest data, filter to tomatoes, find the total weight for each variety, compute a new variable to convert the weights from grams to pounds, and lastly sort the data from largest to smallest total weight in pounds. Which variety is best? Is there any information missing?
dplyr verbs and ggplot()I’m curious if there are certain days during the week where I harvest more or less. In this exercise, start with the garden_harvest data, find the daily harvest in grams for each date, create two new variables: 1. the daily harvest in pounds and 2. day of the week, plot the data so for each day of the week (on the y-axis) a boxplot of the daily harvest in pounds is created.